Introduction

The blood sample data comes from patients from the region of Wuhan, China. The data has been collected between 10 January and 18 February 2020. The original goal of collecting this data was to help identify crucial predictive biomarkers of disease mortality. More information can be found in the Tan et al article.

The analysis will focus on what we know about the patients. It will also attempt to see whether it’s possible to predict that a patient will die based on the available sample results.

Since the dataset contains a multitude of biomarkers the classification will focus on three of them which Tan et al have pointed out in their article:

  • lactic dehydrogenase (LDH),
  • lymphocyte,
  • and high-sensitivity C-reactive protein (hs-CRP).

Tan et al were able to use them to predict the mortality of individual patients more than 10 days in advance with more than 90% accuracy.

In the original dataset, aside from basic information about the patient, each row contains a timestamp of the blood sample results and the results for a select few biomarkers. The other biomarkers in a row are empty values. Thus for some parts of the analysis the dataset will have to be tweaked to make up for this. We will assume that, if a row doesn’t contain information about a biomarker, the closest approximate will be the most recent value of said biomarker in the past samples for this patient. If there are no samples that consider this biomarker’s value, then the first future value is taken. If no value for the biomarker is available for a patient, then the median of the whole dataset is used. Please bare the above in mind, as it is probably not ideal and can skew some of the results of the analysis.

14 samples, each corresponding to a different patient, are missing a registration date. They were the only samples for these patients. The patients’ admission date is assumed as the sample registration date.

Dataset overview

The original dataset contains 81 columns and 6120 rows. Each row corresponds to a blood sample result. The data concerns 375 patients. For each patient there are multiple sample results.

Let’s take a look at the information we have about the patients.

patient_id age gender admission_time discharge_time death hospitalized_days
Min. : 1.0 Min. :18.00 male :224 Min. :2020-01-10 15:52:20 Min. :2020-01-23 09:09:23 no :201 Min. : 0.0847
1st Qu.: 94.5 1st Qu.:46.00 female:151 1st Qu.:2020-02-01 19:27:40 1st Qu.:2020-02-11 13:39:21 yes:174 1st Qu.: 4.4845
Median :188.0 Median :62.00 NA Median :2020-02-04 22:30:34 Median :2020-02-16 17:40:07 NA Median : 9.5942
Mean :188.0 Mean :58.83 NA Mean :2020-02-04 20:13:51 Mean :2020-02-15 16:42:59 NA Mean :10.8536
3rd Qu.:281.5 3rd Qu.:70.00 NA 3rd Qu.:2020-02-10 04:11:10 3rd Qu.:2020-02-19 11:47:14 NA 3rd Qu.:15.6876
Max. :375.0 Max. :95.00 NA Max. :2020-02-17 21:30:07 Max. :2020-03-04 16:21:51 NA Max. :35.1708

According to this data males are more likely to die. Please note that the dataset contains less data for females than for males.

Older patients appear to be more likely to die.

A significant amount of patients dies shortly after being hospitalized. Most likely, we are observing the patients who are admitted in critical condition.

And here’s a short summary of all of the available attributes (patient info and biomarkers) before the cleaning of the dataset for further analysis:

patient_id re_date age gender admission_time discharge_time death hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time procalcitonin eosinophils interleukin_2_receptor alkaline_phosphatase albumin basophil interleukin_10 total_bilirubin platelet_count monocytes antithrombin interleukin_8 indirect_bilirubin red_blood_cell_distribution_width neutrophils total_protein quantification_of_treponema_pallidum_antibodies prothrombin_activity h_bs_ag mean_corpuscular_volume hematocrit white_blood_cell_count tumor_necrosis_factor_u_03b1 mean_corpuscular_hemoglobin_concentration fibrinogen interleukin_1ss urea lymphocyte_count ph_value red_blood_cell_count eosinophil_count corrected_calcium serum_potassium glucose neutrophils_count direct_bilirubin mean_platelet_volume ferritin rbc_distribution_width_sd thrombin_time x_lymphocyte hcv_antibody_quantification d_d_dimer total_cholesterol aspartate_aminotransferase uric_acid hco3 calcium amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp lactate_dehydrogenase platelet_large_cell_ratio interleukin_6 fibrin_degradation_products monocytes_count plt_distribution_width globulin x_u_03b3_glutamyl_transpeptidase international_standard_ratio basophil_count x2019_n_co_v_nucleic_acid_detection mean_corpuscular_hemoglobin activation_of_partial_thromboplastin_time high_sensitivity_c_reactive_protein hiv_antibody_quantification serum_sodium thrombocytocrit esr glutamic_pyruvic_transaminase e_gfr creatinine
Min. : 1.0 Min. :2020-01-10 19:45:00 Min. :18.00 Min. :1.000 Min. :2020-01-10 15:52:20 Min. :2020-01-23 09:09:23 Min. :0.0000 Min. : 1.9 Min. : 6.4 Min. : 71.50 Min. : 11.50 Min. : 0.020 Min. :0.000 Min. : 61.0 Min. : 17.00 Min. :13.60 Min. :0.00 Min. : 5.00 Min. : 2.50 Min. : -1.0 Min. : 0.300 Min. : 20.00 Min. : 5.000 Min. : 0.100 Min. :10.60 Min. : 1.7 Min. :31.80 Min. : 0.020 Min. : 6.00 Min. : 0.000 Min. : 61.60 Min. :14.50 Min. : 0.13 Min. : 4.00 Min. :286.0 Min. : 0.500 Min. : 5.00 Min. : 0.800 Min. : 0.000 Min. :5.000 Min. : 0.100 Min. :0.000 Min. :1.650 Min. : 2.760 Min. : 1.000 Min. : 0.06 Min. : 1.600 Min. : 8.50 Min. : 17.8 Min. : 31.30 Min. : 13.00 Min. : 0.000 Min. :0.020 Min. : 0.210 Min. :0.100 Min. : 6.00 Min. : 43.0 Min. : 6.30 Min. :1.170 Min. : 5 Min. : 110.0 Min. :11.20 Min. : 1.500 Min. : 4.00 Min. : 0.010 Min. : 8.00 Min. :10.10 Min. : 3.00 Min. : 0.840 Min. :0.000 Min. :-1 Min. :20.4 Min. : 21.80 Min. : 0.10 Min. :0.05 Min. :115.4 Min. :0.010 Min. : 1.00 Min. : 5.00 Min. : 2.00 Min. : 11.00
1st Qu.: 92.0 1st Qu.:2020-02-04 13:46:00 1st Qu.:47.00 1st Qu.:1.000 1st Qu.:2020-02-01 00:06:16 1st Qu.:2020-02-13 19:06:26 1st Qu.:0.0000 1st Qu.: 4.4 1st Qu.:113.0 1st Qu.: 99.05 1st Qu.: 13.60 1st Qu.: 0.040 1st Qu.:0.000 1st Qu.: 459.5 1st Qu.: 54.00 1st Qu.:27.40 1st Qu.:0.10 1st Qu.: 5.00 1st Qu.: 7.40 1st Qu.:109.0 1st Qu.: 2.800 1st Qu.: 74.00 1st Qu.: 8.675 1st Qu.: 3.800 1st Qu.:12.00 1st Qu.:65.1 1st Qu.:61.00 1st Qu.: 0.040 1st Qu.: 65.00 1st Qu.: 0.000 1st Qu.: 86.90 1st Qu.:33.50 1st Qu.: 4.94 1st Qu.: 6.70 1st Qu.:333.0 1st Qu.: 3.050 1st Qu.: 5.00 1st Qu.: 4.000 1st Qu.: 0.460 1st Qu.:6.000 1st Qu.: 3.680 1st Qu.:0.000 1st Qu.:2.270 1st Qu.: 3.950 1st Qu.: 5.550 1st Qu.: 3.09 1st Qu.: 3.225 1st Qu.:10.10 1st Qu.: 377.2 1st Qu.: 38.50 1st Qu.: 15.60 1st Qu.: 3.925 1st Qu.:0.040 1st Qu.: 0.603 1st Qu.:3.010 1st Qu.: 19.50 1st Qu.: 183.2 1st Qu.:21.00 1st Qu.:1.980 1st Qu.: 150 1st Qu.: 218.0 1st Qu.:25.60 1st Qu.: 4.772 1st Qu.: 4.00 1st Qu.: 0.270 1st Qu.:11.10 1st Qu.:29.70 1st Qu.: 22.00 1st Qu.: 1.030 1st Qu.:0.010 1st Qu.:-1 1st Qu.:29.7 1st Qu.: 35.30 1st Qu.: 5.70 1st Qu.:0.07 1st Qu.:137.7 1st Qu.:0.150 1st Qu.: 14.00 1st Qu.: 16.00 1st Qu.: 63.58 1st Qu.: 58.00
Median :185.0 Median :2020-02-09 12:50:00 Median :62.00 Median :1.000 Median :2020-02-04 15:53:12 Median :2020-02-17 21:50:30 Median :0.0000 Median : 20.6 Median :125.0 Median :102.10 Median : 14.80 Median : 0.100 Median :0.100 Median : 676.5 Median : 69.50 Median :32.20 Median :0.20 Median : 5.90 Median : 10.70 Median :178.0 Median : 5.700 Median : 86.00 Median : 16.000 Median : 5.400 Median :12.60 Median :82.4 Median :65.90 Median : 0.050 Median : 81.00 Median : 0.010 Median : 90.10 Median :36.60 Median : 7.72 Median : 8.60 Median :343.0 Median : 4.120 Median : 5.00 Median : 5.985 Median : 0.800 Median :6.500 Median : 4.140 Median :0.010 Median :2.360 Median : 4.410 Median : 6.990 Median : 5.85 Median : 4.800 Median :10.80 Median : 711.0 Median : 40.90 Median : 16.80 Median :11.450 Median :0.060 Median : 2.155 Median :3.630 Median : 27.00 Median : 243.7 Median :23.50 Median :2.080 Median : 585 Median : 340.0 Median :30.90 Median : 19.265 Median : 17.90 Median : 0.410 Median :12.40 Median :32.70 Median : 34.00 Median : 1.140 Median :0.010 Median :-1 Median :30.9 Median : 39.20 Median : 51.50 Median :0.09 Median :140.4 Median :0.210 Median : 28.00 Median : 24.00 Median : 87.90 Median : 76.00
Mean :184.8 Mean :2020-02-08 07:09:59 Mean :59.44 Mean :1.391 Mean :2020-02-03 18:57:56 Mean :2020-02-16 21:40:09 Mean :0.4747 Mean : 1223.2 Mean :123.1 Mean :103.14 Mean : 16.68 Mean : 1.107 Mean :0.629 Mean : 907.2 Mean : 82.47 Mean :32.01 Mean :0.21 Mean : 16.07 Mean : 16.70 Mean :184.3 Mean : 6.155 Mean : 85.32 Mean : 83.088 Mean : 6.889 Mean :13.07 Mean :77.6 Mean :65.30 Mean : 0.132 Mean : 78.55 Mean : 8.306 Mean : 90.39 Mean :36.55 Mean : 15.60 Mean : 11.58 Mean :342.8 Mean : 4.294 Mean : 6.51 Mean : 9.589 Mean : 1.017 Mean :6.484 Mean : 9.288 Mean :0.039 Mean :2.355 Mean : 4.509 Mean : 8.889 Mean : 7.81 Mean : 9.887 Mean :10.91 Mean : 1379.1 Mean : 42.44 Mean : 18.17 Mean :15.392 Mean :0.117 Mean : 7.943 Mean :3.689 Mean : 46.53 Mean : 276.1 Mean :23.14 Mean :2.078 Mean : 3669 Mean : 474.2 Mean :31.77 Mean : 112.308 Mean : 61.35 Mean : 0.526 Mean :13.01 Mean :33.24 Mean : 55.34 Mean : 1.313 Mean :0.017 Mean :-1 Mean :31.0 Mean : 41.52 Mean : 76.24 Mean :0.10 Mean :141.6 Mean :0.212 Mean : 33.69 Mean : 38.86 Mean : 81.56 Mean : 109.93
3rd Qu.:270.0 3rd Qu.:2020-02-13 10:36:00 3rd Qu.:71.00 3rd Qu.:2.000 3rd Qu.:2020-02-09 02:06:58 3rd Qu.:2020-02-19 13:30:26 3rd Qu.:1.0000 3rd Qu.: 223.8 3rd Qu.:137.0 3rd Qu.:105.65 3rd Qu.: 16.70 3rd Qu.: 0.405 3rd Qu.:0.800 3rd Qu.:1155.5 3rd Qu.: 95.00 3rd Qu.:36.60 3rd Qu.:0.30 3rd Qu.: 12.35 3rd Qu.: 16.77 3rd Qu.:248.0 3rd Qu.: 8.600 3rd Qu.: 97.00 3rd Qu.: 35.200 3rd Qu.: 8.000 3rd Qu.:13.70 3rd Qu.:92.3 3rd Qu.:70.45 3rd Qu.: 0.070 3rd Qu.: 95.00 3rd Qu.: 0.010 3rd Qu.: 93.90 3rd Qu.:39.90 3rd Qu.: 12.72 3rd Qu.: 11.50 3rd Qu.:350.0 3rd Qu.: 5.480 3rd Qu.: 5.00 3rd Qu.:11.400 3rd Qu.: 1.310 3rd Qu.:7.294 3rd Qu.: 4.650 3rd Qu.:0.060 3rd Qu.:2.440 3rd Qu.: 4.870 3rd Qu.:10.260 3rd Qu.:10.95 3rd Qu.: 8.275 3rd Qu.:11.50 3rd Qu.: 1425.2 3rd Qu.: 44.70 3rd Qu.: 18.38 3rd Qu.:24.975 3rd Qu.:0.090 3rd Qu.:21.000 3rd Qu.:4.265 3rd Qu.: 42.00 3rd Qu.: 333.8 3rd Qu.:25.90 3rd Qu.:2.190 3rd Qu.: 2625 3rd Qu.: 601.8 3rd Qu.:37.20 3rd Qu.: 60.167 3rd Qu.:150.00 3rd Qu.: 0.580 3rd Qu.:14.30 3rd Qu.:36.50 3rd Qu.: 58.00 3rd Qu.: 1.330 3rd Qu.:0.020 3rd Qu.:-1 3rd Qu.:32.2 3rd Qu.: 44.12 3rd Qu.:118.50 3rd Qu.:0.11 3rd Qu.:143.5 3rd Qu.:0.270 3rd Qu.: 45.50 3rd Qu.: 41.00 3rd Qu.:103.97 3rd Qu.: 98.25
Max. :375.0 Max. :2020-02-18 17:49:00 Max. :95.00 Max. :2.000 Max. :2020-02-17 21:30:07 Max. :2020-03-04 16:21:51 Max. :1.0000 Max. :50000.0 Max. :178.0 Max. :140.40 Max. :120.00 Max. :57.170 Max. :8.600 Max. :7500.0 Max. :620.00 Max. :48.60 Max. :1.70 Max. :1000.00 Max. :505.70 Max. :558.0 Max. :53.000 Max. :136.00 Max. :6795.000 Max. :145.100 Max. :27.10 Max. :98.9 Max. :88.70 Max. :11.950 Max. :142.00 Max. :250.000 Max. :118.90 Max. :52.30 Max. :1726.60 Max. :168.00 Max. :514.0 Max. :10.780 Max. :88.50 Max. :68.400 Max. :52.420 Max. :7.565 Max. :749.500 Max. :0.490 Max. :2.790 Max. :12.800 Max. :43.010 Max. :33.88 Max. :360.600 Max. :15.00 Max. :50000.0 Max. :113.30 Max. :161.90 Max. :60.000 Max. :2.090 Max. :60.000 Max. :7.300 Max. :1858.00 Max. :1176.0 Max. :36.30 Max. :2.620 Max. :70000 Max. :1867.0 Max. :62.20 Max. :5000.000 Max. :190.80 Max. :39.920 Max. :25.30 Max. :50.60 Max. :732.00 Max. :13.480 Max. :0.120 Max. :-1 Max. :50.8 Max. :144.00 Max. :320.00 Max. :0.27 Max. :179.7 Max. :0.510 Max. :110.00 Max. :1600.00 Max. :224.00 Max. :1497.00
NA NA NA NA NA NA NA NA’s :5613 NA’s :5145 NA’s :5145 NA’s :5458 NA’s :5661 NA’s :5163 NA’s :5852 NA’s :5190 NA’s :5186 NA’s :5163 NA’s :5853 NA’s :5190 NA’s :5163 NA’s :5162 NA’s :5790 NA’s :5852 NA’s :5214 NA’s :5197 NA’s :5163 NA’s :5189 NA’s :5841 NA’s :5461 NA’s :5841 NA’s :5163 NA’s :5163 NA’s :4993 NA’s :5852 NA’s :5163 NA’s :5554 NA’s :5852 NA’s :5184 NA’s :5163 NA’s :5736 NA’s :4993 NA’s :5163 NA’s :5206 NA’s :5140 NA’s :5345 NA’s :5163 NA’s :5190 NA’s :5258 NA’s :5837 NA’s :5197 NA’s :5554 NA’s :5162 NA’s :5841 NA’s :5490 NA’s :5189 NA’s :5185 NA’s :5186 NA’s :5186 NA’s :5141 NA’s :5645 NA’s :5186 NA’s :5258 NA’s :5848 NA’s :5790 NA’s :5163 NA’s :5258 NA’s :5190 NA’s :5190 NA’s :5461 NA’s :5163 NA’s :5619 NA’s :5163 NA’s :5552 NA’s :5383 NA’s :5842 NA’s :5145 NA’s :5258 NA’s :5737 NA’s :5189 NA’s :5184 NA’s :5184

After cleaning, previously explained in the Introduction:

patient_id re_date age gender admission_time discharge_time death hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time procalcitonin eosinophils interleukin_2_receptor alkaline_phosphatase albumin basophil interleukin_10 total_bilirubin platelet_count monocytes antithrombin interleukin_8 indirect_bilirubin red_blood_cell_distribution_width neutrophils total_protein quantification_of_treponema_pallidum_antibodies prothrombin_activity h_bs_ag mean_corpuscular_volume hematocrit white_blood_cell_count tumor_necrosis_factor_u_03b1 mean_corpuscular_hemoglobin_concentration fibrinogen interleukin_1ss urea lymphocyte_count ph_value red_blood_cell_count eosinophil_count corrected_calcium serum_potassium glucose neutrophils_count direct_bilirubin mean_platelet_volume ferritin rbc_distribution_width_sd thrombin_time x_lymphocyte hcv_antibody_quantification d_d_dimer total_cholesterol aspartate_aminotransferase uric_acid hco3 calcium amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp lactate_dehydrogenase platelet_large_cell_ratio interleukin_6 fibrin_degradation_products monocytes_count plt_distribution_width globulin x_u_03b3_glutamyl_transpeptidase international_standard_ratio basophil_count x2019_n_co_v_nucleic_acid_detection mean_corpuscular_hemoglobin activation_of_partial_thromboplastin_time high_sensitivity_c_reactive_protein hiv_antibody_quantification serum_sodium thrombocytocrit esr glutamic_pyruvic_transaminase e_gfr creatinine
Min. : 1.0 Min. :2020-01-10 19:45:00 Min. :18.00 Min. :1.000 Min. :2020-01-10 15:52:20 Min. :2020-01-23 09:09:23 Min. :0.0000 Min. : 1.9 Min. : 6.4 Min. : 71.5 Min. : 11.50 Min. : 0.0200 Min. :0.0000 Min. : 61.0 Min. : 17.00 Min. :13.60 Min. :0.0000 Min. : 5.00 Min. : 2.50 Min. : -1.0 Min. : 0.300 Min. : 20.00 Min. : 5.00 Min. : 0.100 Min. :10.60 Min. : 1.70 Min. :31.80 Min. : 0.0200 Min. : 6.00 Min. : 0.00 Min. : 61.60 Min. :14.50 Min. : 0.13 Min. : 4.0 Min. :286.0 Min. : 0.500 Min. : 5.000 Min. : 0.800 Min. : 0.0000 Min. :5.000 Min. : 0.100 Min. :0.00000 Min. :1.650 Min. : 2.760 Min. : 1.000 Min. : 0.060 Min. : 1.600 Min. : 8.50 Min. : 17.8 Min. : 31.30 Min. : 13.00 Min. : 0.00 Min. :0.02000 Min. : 0.210 Min. :0.10 Min. : 6.00 Min. : 43.0 Min. : 6.30 Min. :1.170 Min. : 5 Min. : 110 Min. :11.20 Min. : 1.50 Min. : 4.00 Min. : 0.0100 Min. : 8.00 Min. :10.10 Min. : 3.00 Min. : 0.840 Min. :0.00000 Min. :-1 Min. :20.40 Min. : 21.80 Min. : 0.10 Min. :0.05000 Min. :115.4 Min. :0.0100 Min. : 1.00 Min. : 5.0 Min. : 2.00 Min. : 11.0
1st Qu.: 92.0 1st Qu.:2020-02-04 13:46:00 1st Qu.:47.00 1st Qu.:1.000 1st Qu.:2020-02-01 00:06:16 1st Qu.:2020-02-13 19:06:26 1st Qu.:0.0000 1st Qu.: 3.7 1st Qu.:114.0 1st Qu.: 98.8 1st Qu.: 13.50 1st Qu.: 0.0400 1st Qu.:0.0000 1st Qu.: 585.0 1st Qu.: 54.00 1st Qu.:28.40 1st Qu.:0.1000 1st Qu.: 5.00 1st Qu.: 7.20 1st Qu.:121.0 1st Qu.: 3.100 1st Qu.: 84.00 1st Qu.: 12.60 1st Qu.: 3.600 1st Qu.:11.90 1st Qu.:65.10 1st Qu.:62.20 1st Qu.: 0.0400 1st Qu.: 70.00 1st Qu.: 0.00 1st Qu.: 86.80 1st Qu.:33.80 1st Qu.: 4.84 1st Qu.: 7.7 1st Qu.:334.0 1st Qu.: 3.400 1st Qu.: 5.000 1st Qu.: 3.840 1st Qu.: 0.5000 1st Qu.:6.000 1st Qu.: 3.710 1st Qu.:0.00000 1st Qu.:2.270 1st Qu.: 3.920 1st Qu.: 5.630 1st Qu.: 2.980 1st Qu.: 3.200 1st Qu.:10.20 1st Qu.: 582.5 1st Qu.: 38.40 1st Qu.: 15.80 1st Qu.: 4.50 1st Qu.:0.05000 1st Qu.: 0.510 1st Qu.:2.97 1st Qu.: 20.00 1st Qu.: 185.0 1st Qu.:21.00 1st Qu.:2.000 1st Qu.: 111 1st Qu.: 226 1st Qu.:26.60 1st Qu.: 13.33 1st Qu.: 4.70 1st Qu.: 0.2800 1st Qu.:11.30 1st Qu.:30.10 1st Qu.: 21.00 1st Qu.: 1.030 1st Qu.:0.01000 1st Qu.:-1 1st Qu.:29.70 1st Qu.: 36.40 1st Qu.: 8.70 1st Qu.:0.08000 1st Qu.:137.3 1st Qu.:0.1400 1st Qu.: 18.00 1st Qu.: 15.0 1st Qu.: 66.80 1st Qu.: 58.0
Median :185.0 Median :2020-02-09 12:50:00 Median :62.00 Median :1.000 Median :2020-02-04 15:53:12 Median :2020-02-17 21:50:30 Median :0.0000 Median : 12.9 Median :126.0 Median :101.7 Median : 14.30 Median : 0.1000 Median :0.1000 Median : 778.0 Median : 68.00 Median :33.05 Median :0.2000 Median : 7.50 Median : 10.30 Median :180.0 Median : 6.000 Median : 88.00 Median : 17.10 Median : 5.300 Median :12.50 Median :80.90 Median :66.70 Median : 0.0500 Median : 86.00 Median : 0.01 Median : 89.80 Median :36.90 Median : 7.33 Median : 8.7 Median :343.0 Median : 4.410 Median : 5.000 Median : 5.600 Median : 0.7900 Median :6.500 Median : 4.160 Median :0.01000 Median :2.360 Median : 4.330 Median : 6.960 Median : 5.420 Median : 4.600 Median :10.80 Median : 826.8 Median : 40.60 Median : 16.70 Median :12.40 Median :0.06000 Median : 1.350 Median :3.59 Median : 28.50 Median : 243.4 Median :23.20 Median :2.100 Median : 332 Median : 338 Median :31.40 Median : 25.36 Median : 7.40 Median : 0.4000 Median :12.60 Median :33.10 Median : 33.00 Median : 1.100 Median :0.01000 Median :-1 Median :30.90 Median : 39.40 Median : 51.90 Median :0.09000 Median :140.1 Median :0.2000 Median : 31.00 Median : 23.0 Median : 89.20 Median : 76.0
Mean :184.8 Mean :2020-02-08 07:09:59 Mean :59.44 Mean :1.391 Mean :2020-02-03 18:57:56 Mean :2020-02-16 21:40:09 Mean :0.4747 Mean : 800.5 Mean :125.1 Mean :102.3 Mean : 15.51 Mean : 0.6811 Mean :0.5653 Mean : 910.1 Mean : 80.71 Mean :32.72 Mean :0.2012 Mean : 15.11 Mean : 15.74 Mean :187.6 Mean : 6.357 Mean : 88.11 Mean : 41.74 Mean : 6.789 Mean :12.99 Mean :77.12 Mean :66.24 Mean : 0.1453 Mean : 82.62 Mean : 4.91 Mean : 90.04 Mean :36.78 Mean : 12.39 Mean : 10.7 Mean :343.4 Mean : 4.476 Mean : 5.947 Mean : 8.359 Mean : 0.9744 Mean :6.434 Mean : 7.756 Mean :0.03407 Mean :2.351 Mean : 4.408 Mean : 8.643 Mean : 7.429 Mean : 8.983 Mean :10.98 Mean : 1288.9 Mean : 42.05 Mean : 17.76 Mean :15.73 Mean :0.09682 Mean : 6.297 Mean :3.65 Mean : 41.89 Mean : 271.4 Mean :23.01 Mean :2.094 Mean : 1920 Mean : 453 Mean :32.32 Mean : 73.12 Mean : 36.54 Mean : 0.4894 Mean :13.19 Mean :33.50 Mean : 54.76 Mean : 1.235 Mean :0.01592 Mean :-1 Mean :30.93 Mean : 40.58 Mean : 75.75 Mean :0.09457 Mean :140.6 Mean :0.2063 Mean : 34.55 Mean : 34.5 Mean : 83.74 Mean : 99.1
3rd Qu.:270.0 3rd Qu.:2020-02-13 10:36:00 3rd Qu.:71.00 3rd Qu.:2.000 3rd Qu.:2020-02-09 02:06:58 3rd Qu.:2020-02-19 13:30:26 3rd Qu.:1.0000 3rd Qu.: 38.6 3rd Qu.:138.0 3rd Qu.:104.6 3rd Qu.: 15.80 3rd Qu.: 0.3100 3rd Qu.:0.7000 3rd Qu.:1026.0 3rd Qu.: 91.00 3rd Qu.:37.30 3rd Qu.:0.3000 3rd Qu.: 9.90 3rd Qu.: 15.40 3rd Qu.:245.0 3rd Qu.: 8.800 3rd Qu.: 92.00 3rd Qu.: 27.10 3rd Qu.: 7.700 3rd Qu.:13.50 3rd Qu.:91.60 3rd Qu.:70.80 3rd Qu.: 0.0600 3rd Qu.: 96.00 3rd Qu.: 0.01 3rd Qu.: 93.40 3rd Qu.:40.10 3rd Qu.: 12.15 3rd Qu.: 10.4 3rd Qu.:351.0 3rd Qu.: 5.410 3rd Qu.: 5.000 3rd Qu.: 9.625 3rd Qu.: 1.2800 3rd Qu.:6.500 3rd Qu.: 4.603 3rd Qu.:0.05000 3rd Qu.:2.430 3rd Qu.: 4.780 3rd Qu.: 9.780 3rd Qu.:10.450 3rd Qu.: 7.500 3rd Qu.:11.60 3rd Qu.: 1185.9 3rd Qu.: 44.10 3rd Qu.: 17.90 3rd Qu.:24.60 3rd Qu.:0.08000 3rd Qu.:11.610 3rd Qu.:4.21 3rd Qu.: 43.00 3rd Qu.: 328.0 3rd Qu.:25.50 3rd Qu.:2.190 3rd Qu.: 843 3rd Qu.: 574 3rd Qu.:37.60 3rd Qu.: 46.28 3rd Qu.: 25.80 3rd Qu.: 0.5800 3rd Qu.:14.60 3rd Qu.:36.52 3rd Qu.: 57.00 3rd Qu.: 1.250 3rd Qu.:0.02000 3rd Qu.:-1 3rd Qu.:32.10 3rd Qu.: 43.40 3rd Qu.:118.10 3rd Qu.:0.10000 3rd Qu.:142.7 3rd Qu.:0.2600 3rd Qu.: 43.00 3rd Qu.: 38.0 3rd Qu.:105.00 3rd Qu.: 97.0
Max. :375.0 Max. :2020-02-18 17:49:00 Max. :95.00 Max. :2.000 Max. :2020-02-17 21:30:07 Max. :2020-03-04 16:21:51 Max. :1.0000 Max. :50000.0 Max. :178.0 Max. :140.4 Max. :120.00 Max. :57.1700 Max. :8.6000 Max. :7500.0 Max. :620.00 Max. :48.60 Max. :1.7000 Max. :1000.00 Max. :505.70 Max. :558.0 Max. :53.000 Max. :136.00 Max. :6795.00 Max. :145.100 Max. :27.10 Max. :98.90 Max. :88.70 Max. :11.9500 Max. :142.00 Max. :250.00 Max. :118.90 Max. :52.30 Max. :1726.60 Max. :168.0 Max. :514.0 Max. :10.780 Max. :88.500 Max. :68.400 Max. :52.4200 Max. :7.565 Max. :749.500 Max. :0.49000 Max. :2.790 Max. :12.800 Max. :43.010 Max. :33.880 Max. :360.600 Max. :15.00 Max. :50000.0 Max. :113.30 Max. :161.90 Max. :60.00 Max. :2.09000 Max. :60.000 Max. :7.30 Max. :1858.00 Max. :1176.0 Max. :36.30 Max. :2.620 Max. :70000 Max. :1867 Max. :62.20 Max. :5000.00 Max. :190.80 Max. :39.9200 Max. :25.30 Max. :50.60 Max. :732.00 Max. :13.480 Max. :0.12000 Max. :-1 Max. :50.80 Max. :144.00 Max. :320.00 Max. :0.27000 Max. :179.7 Max. :0.5100 Max. :110.00 Max. :1600.0 Max. :224.00 Max. :1497.0

Correlation between the attributes

Age appears to be directly correlated with dying of the disease.

A slight negative correlation between the length of the hospital stay and death can be seen. This is consistent with what has been noted in the overview of the data.

The below plot has been limited to attributes that have a (positive or negative) correlation of at least 70%. Otherwise it would become unreadable. The names have been abbreviated for the same reason. Please refer to the below table for original names of the attributes.

Negative correlation can be noticed between death and the lymphocyte sample results.

abbreviated_name original_name
death death
hemglbn hemoglobin
srm_chl serum_chloride
prthrmbn_t prothrombin_time
esnphls eosinophils
albumin albumin
ttl_blr total_bilirubin
pltlt_c platelet_count
moncyts monocytes
indrct_ indirect_bilirubin
rd_b___ red_blood_cell_distribution_width
ntrphls neutrophils
mn_crpsclr_v mean_corpuscular_volume
hemtcrt hematocrit
urea urea
lymphc_ lymphocyte_count
esnphl_ eosinophil_count
ntrphl_ neutrophils_count
drct_bl direct_bilirubin
mn_plt_ mean_platelet_volume
rbc_d__ rbc_distribution_width_sd
x_lymph x_lymphocyte
d_d_dmr d_d_dimer
asprtt_ aspartate_aminotransferase
calcium calcium
pltl___ platelet_large_cell_ratio
fbrn_d_ fibrin_degradation_products
mncyts_ monocytes_count
plt_ds_ plt_distribution_width
intrn__ international_standard_ratio
mn_crpsclr_h mean_corpuscular_hemoglobin
srm_sdm serum_sodium
thrmbcy thrombocytocrit
gltmc__ glutamic_pyruvic_transaminase
e_gfr e_gfr
creatnn creatinine

Survival over time

Classification

In the attempt at classification, the random forest learning method is used. For this purpose the dataset is transformed to the following:

age gender death x_lymphocyte_min x_lymphocyte_max lactate_dehydrogenase_min lactate_dehydrogenase_max high_sensitivity_c_reactive_protein_min high_sensitivity_c_reactive_protein_max
Min. :18.00 Min. :1.000 no :201 Min. : 0.00 Min. : 0.00 Min. : 110.0 Min. : 119.0 Min. : 0.10 Min. : 0.10
1st Qu.:46.00 1st Qu.:1.000 yes:174 1st Qu.: 4.60 1st Qu.: 6.65 1st Qu.: 231.0 1st Qu.: 248.5 1st Qu.: 10.35 1st Qu.: 16.95
Median :62.00 Median :1.000 NA Median :12.40 Median :14.10 Median : 338.0 Median : 340.0 Median : 51.70 Median : 53.10
Mean :58.83 Mean :1.403 NA Mean :15.30 Mean :17.47 Mean : 420.6 Mean : 481.4 Mean : 64.34 Mean : 81.79
3rd Qu.:70.00 3rd Qu.:2.000 NA 3rd Qu.:23.65 3rd Qu.:25.90 3rd Qu.: 518.5 3rd Qu.: 594.0 3rd Qu.: 97.10 3rd Qu.:131.15
Max. :95.00 Max. :2.000 NA Max. :55.00 Max. :60.00 Max. :1867.0 Max. :1867.0 Max. :320.00 Max. :320.00

For each patient, aside from their age and gender, the minimum and maximum values (observed during their hospital stay) of

  • lactic dehydrogenase (LDH),
  • lymphocyte,
  • and high-sensitivity C-reactive protein (hs-CRP)

are considered. As mentioned in the introduction of this analysis, the biomarker choice is based on Tan et al article.

Only samples taken during the 5 day window after each patient’s admission (63% of all samples) were selected for learning. The purpose of this is to simulate an attempt of classifying a patient after they have already been hospitalized for a few days. Most patients are hospitalized for more than 5 days, so there may be purpose in doing this.

70% of the dataset is used as train data and the other 30% are test data.

## Random Forest 
## 
## 263 samples
##   8 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 236, 237, 237, 237, 236, 237, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8797090  0.7581506
##    3    0.8728734  0.7443456
##    4    0.8785368  0.7558239
##    5    0.8730993  0.7446359
##    6    0.8777717  0.7543753
##    7    0.8775438  0.7541035
##    8    0.8774298  0.7532245
##    9    0.8768315  0.7525498
##   10    0.8889967  0.7771683
##   11    0.8722731  0.7435351
##   12    0.8875722  0.7743328
##   13    0.8775458  0.7532014
##   14    0.8829263  0.7649736
##   15    0.8685674  0.7362131
##   16    0.8836691  0.7663287
##   17    0.8843244  0.7676563
##   18    0.8776313  0.7540705
##   19    0.8767745  0.7526481
##   20    0.8738970  0.7467661
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  52   0
##        yes  8  52
##                                           
##                Accuracy : 0.9286          
##                  95% CI : (0.8641, 0.9687)
##     No Information Rate : 0.5357          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8579          
##                                           
##  Mcnemar's Test P-Value : 0.01333         
##                                           
##               Precision : 1.0000          
##                  Recall : 0.8667          
##                      F1 : 0.9286          
##              Prevalence : 0.5357          
##          Detection Rate : 0.4643          
##    Detection Prevalence : 0.4643          
##       Balanced Accuracy : 0.9333          
##                                           
##        'Positive' Class : no              
## 
## rf variable importance
## 
##                                         Overall
## lactate_dehydrogenase_max               66.8163
## age                                     19.3024
## x_lymphocyte_min                        13.3378
## x_lymphocyte_max                        12.8294
## high_sensitivity_c_reactive_protein_min  8.0129
## high_sensitivity_c_reactive_protein_max  6.0812
## lactate_dehydrogenase_min                3.8810
## gender                                   0.1193

To evaluate the above model we need to know what we want to achieve with this classification.

Precision

Let’s assume that we want to find out whether a patient will die to try to save them before it happens. The death class value “no” is a positive and we do not need to be too concerned with such patient. The death class value “yes” is a negative and we need to take special care of them. In this case, the precision measure of 1 achieved by the model on the test data is a good sign.

Recall

The recall of 0.87 could be an issue during a pandemic. The false negatives are a potential drawback due to the fact that the medical system is strained as is. Suggesting that more patients need immediate care could waste precious resources.

Summary

Age

The elderly are more likely to die.

Biomarkers

High values of lactic dehydrogenase measured within 5 days of admission indicate that a patient will die. So do low values of lymphocyte. Among the biomarkers mentioned by Tan et al these two were of the most importance in the model.